17.4 Extrinsic Methods

257

database sequences and the consequences for deductive inference do not yet seem

to have been given the attention they deserve. In particular, there appears to be a

feeling associated with the “big data” movement that, provided one has enough data,

the errors will somehow be “averaged out” or “autocompensated”, although proper

justification for this notion is lacking.

17.4.2

Sequence Comparison and Alignment

The pairwise comparison of sequences is very widely used in bioinformatics. Evi-

dently, it is a subset of the general problem of pattern recognition (Sect. 13.1). If it

were only a question of finding matches to more or less lengthy blocks of symbol

sequences (e.g., the longest common subsequence; LCS), the task would be relatively

straightforward and the main work would be merely to assess the statistical signifi-

cance of the result; that is, compare with the null hypothesis that a match occurred

by chance (cf. Sect. 9.2.1). In reality, however, the two sequences one is trying to

compare differ due to mutations, insertions, and deletions (cf. Sect. 14.7.1), which

renders the problem considerably more complicated; one has to allow for gaps, and

one tries to make inferences from local alignments between subsequences. A typical

example of an attempt to align fragments of two nucleotide sequences is

upper I left parenthesis s Subscript a Baseline comma s Subscript b Baseline right parenthesis equals upper I left parenthesis s Subscript b Baseline comma s Subscript a Baseline right parenthesis equals upper I left parenthesis s Subscript a Baseline right parenthesis minus upper I left parenthesis s Subscript a Baseline vertical bar s Subscript b Baseline right parenthesis equals upper I left parenthesis s Subscript b Baseline right parenthesis minus upper I left parenthesis s Subscript b Baseline vertical bar s Subscript a Baseline right parenthesis period

A

C

G

T

A

C

G

T

A

G

T

|

|

|

|

|

|

|

|

A

C

A

T

G

T

A

C

G

T

where vertical lines indicate matches. Note the gaps that have been inserted to increase

the number of matches. In the absence of gaps, one could simply compute the Ham-

ming distance between two sequences; the introduction of the possibility of gaps

introduces two problems: (i) the number of possible alignments becomes very large

and (ii) where are gaps to be placed in sequence space?

If no gaps are allowed, one assigns and sums scores for all possible pairs of aligned

substrings within the two sequences to be matched. If gaps are allowed, there areStartBinomialOrMatrix 2 n Choose n EndBinomialOrMatrix

(2n

n

)

possible alignments of two sequences each of length nn. 16 Even for moderate values

ofnn, there are too many possibilities to be enumerated (problem (i), a computational

one). It is solved using dynamic programming algorithms (Sect. 17.4.4). Problem

(ii) is solved by devising a scoring system with which gaps and substitutions can be

assigned numerical values. Finally, one needs to assess the statistical significance of

the alignment. This is still an unsolved problem—let us call it problem (iii).

The essence of sequence alignment is to assign a score, or cost, for each possible

alignment; the one with the lowest cost, or highest score, is the best one, and if

16 This is obtained by considering the number of ways of intercalating two sequences while pre-

serving the order of symbols in each.